Introduction

What we do:

Our research group works in eelgrass habitats (a marine vascular plant) and often needs to estimate the quality of the eelgrass as a nearshore habitat. In order to do this we look at various morphometric measurements of the eelgrass blades themselves (leaf length) and of the meadow (density). See this website to get an understanding of what an eelgrass blade is and which measurements we’re talking about. We often will also take samples back to the lab to measure both dry and wet weight of the eelgrass to get an estimate of biomass at different sites.

What we’re intersted in:

We’re curious to see if there is a relationship between longest leaf length to total shoot biomass. The motivation for doing this is to see if we can estimate biomass without having to bring sample back to the lab and recreate measurements we didn’t take during the 2019 field season.

Data available: In 2017 21 sites were surveyed for eelgrass. At each site, 8 - \({0.25^2}\) quadrats were surveyed and 5 entire shoots (i.e. the entire plant that includes multiple leaves) were removed and brought back to the lab to be measured (in cm). Each shoot (n = 840) was weighed to get total fresh and dry biomass (in grams) and each leaf on the shoot had its length (cm) and width (cm) measured.

We are interested in the length of the longest leaf and the dry biomass. ## Read in 2017 eelgrass

eelgrass <- read.csv("~/Desktop/Grad School/UAF/Data/APECS Master repository/APECS Master repo/ALL_DATA/seagrass_biometrics_CLEAN.csv")

Data Cleaning

Calculate dry weights of each shoot and include only data collected before July (since that corresponds to the shoot leaf length data we have from 2019).

# extract what we're interested in 
str(eelgrass)
eel <- eelgrass %>%
  mutate(shoot_dw = shoot_foil_dw-shoot_foil)  # calculate dry weight
# determin max leaf length
eel2 <- eel %>% 
  rowwise() %>%
  mutate(max_length = max(leaf_length1, leaf_length2, leaf_length3, leaf_length4, leaf_length5, leaf_length6, leaf_length7, leaf_length8, leaf_length9, leaf_length10, na.rm=TRUE)) %>%
  mutate(mean_length = mean(leaf_length1, leaf_length2, leaf_length3, leaf_length4, leaf_length5, leaf_length6, leaf_length7, leaf_length8, leaf_length9, leaf_length10, na.rm=TRUE))
## Warning in max(leaf_length1, leaf_length2, leaf_length3, leaf_length4,
## leaf_length5, : no non-missing arguments to max; returning -Inf

## Warning in max(leaf_length1, leaf_length2, leaf_length3, leaf_length4,
## leaf_length5, : no non-missing arguments to max; returning -Inf

## Warning in max(leaf_length1, leaf_length2, leaf_length3, leaf_length4,
## leaf_length5, : no non-missing arguments to max; returning -Inf

## Warning in max(leaf_length1, leaf_length2, leaf_length3, leaf_length4,
## leaf_length5, : no non-missing arguments to max; returning -Inf

## Warning in max(leaf_length1, leaf_length2, leaf_length3, leaf_length4,
## leaf_length5, : no non-missing arguments to max; returning -Inf

## Warning in max(leaf_length1, leaf_length2, leaf_length3, leaf_length4,
## leaf_length5, : no non-missing arguments to max; returning -Inf

## Warning in max(leaf_length1, leaf_length2, leaf_length3, leaf_length4,
## leaf_length5, : no non-missing arguments to max; returning -Inf
str(eel2)

# subset only the data that was collected before July (the main part of the growing season)

eel_sub <- eel2 %>%
  filter(YYYYMMDD < 20170701)
str(eel_sub)
levels(as.factor(eel_sub$YYYYMMDD))
summary(eel_sub)
eel_sub2 <- na.omit(data.frame(eel_sub$shoot_dw, eel_sub$max_length, eel_sub$mean_length, eel_sub$shoot_mass_fw))
names(eel_sub2) <- c("shoot_dw", "max_length", "mean_length", "shoot_mass_fw")

Look at data

Some basic data exploration to look at normality and potential outliers. There appears to be some skew. Lets look at different possible transformations to see which looks best.

plot(eel_sub2$max_length, eel_sub2$shoot_dw)

# Looks like the variance in shoot dw might be increasing with the the length (mm) of the longest blade. 
plot(eel_sub2$mean_length, eel_sub2$shoot_dw)

plot(eel_sub2$max_length, eel_sub2$shoot_mass_fw)

# not as tight as a correlation
cor(eel_sub2$max_length, eel_sub2$shoot_dw, use = "complete.obs")
## [1] 0.8782586
cor(eel_sub2$mean_length, eel_sub2$shoot_dw, use = "complete.obs")
## [1] 0.66646
cor(eel_sub2$mean_length, eel_sub2$shoot_mass_fw, use = "complete.obs")
## [1] 0.6754984
# stronger positive linear relationship with max length than average length (makes sense)
par(mfrow=c(1,2))
boxplot(eel_sub2$max_length)
boxplot(eel_sub2$shoot_dw)

# Looks like both varibles may have some outliers, in particular the shoot dry weight. 

plot(density(eel_sub2$max_length), ylab = "Frequency")
plot(density(eel_sub2$shoot_dw, na.rm = T), ylab = "Frequency")

Looking at some of the transformed data: log transformed

hist(eel_sub2$max_length)

# max length looks pretty normal with some slight right? skew
hist(eel_sub2$shoot_dw)

# heavily right skewed does log make it look better?
hist(log(eel_sub2$shoot_dw))

# skews it the other way... 

Linear models

Data only before July

Okay lets try to fit a linear model to this data so that (later) we can calculate unknown biomass from leaf length. Based on original data.

dat <- na.omit(data.frame(eel_sub2$shoot_dw, eel_sub2$max_length))
names(dat) <- c("dw", "max_length")

plot(dat$dw, dat$max_length)

# Lets fit an untransformed linear model 
fit.lm <- lm(dw ~ max_length, data = dat)
plot(fit.lm)

# looks like there might be some funky stuff happening with the residuals -- maybe an increase in variance along the fitted values. And the QQ plot looks real gross with some concavity (i.e. skewness) and some possible outliers. The cooks plot makes it look like there case numbers (171, 230, 498) could be influencial outliers. Some of this is consisten with the histogram of the data (see above)
summary(fit.lm)
## 
## Call:
## lm(formula = dw ~ max_length, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20351 -0.04010 -0.00246  0.02434  0.52698 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0917116  0.0076619  -11.97   <2e-16 ***
## max_length   0.0062482  0.0001505   41.52   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07254 on 511 degrees of freedom
## Multiple R-squared:  0.7713, Adjusted R-squared:  0.7709 
## F-statistic:  1724 on 1 and 511 DF,  p-value: < 2.2e-16
AIC(fit.lm); BIC(fit.lm)
## [1] -1232.068
## [1] -1219.347
# looks like max length is significantly different than 0
# Has an adjusted r2 of 69%... pretty good for ecology. Model is significant, coefficients for the model are not equal to zero (p value =~0)
library(e1071)
skewness(dat$dw)
## [1] 1.654238
skewness(dat$max_length)
## [1] 0.8226692
# the response variable is skewed.... highly skewed

e <- residuals(fit.lm)
shapiro.test(e)
## 
##  Shapiro-Wilk normality test
## 
## data:  e
## W = 0.87574, p-value < 2.2e-16
n <- length(dat$dw)

ggplot(dat, aes(x = max_length, y = dw)) + geom_point() + theme_classic() +
  geom_smooth(method = "lm", se = TRUE)

# Does a boxcox indicate a transformation is necessary? 
library(MASS)
boxcox(fit.lm)

boxcox(fit.lm, lambda=seq(from=0, to=0.6, by=.01)) 

# Umm of like square root?
# Could also try a log



hist(dat$dw^0.5)

y.2 <- dat$y^(0.5)
# looks way better with a fourth root transformation, normal distribution
fit2 <- lm(dw^(0.5) ~ max_length, data = dat)
plot(fit2)

# looks a lot better, homoscedascity is good, qq plot tails are a little weird, still some outliers
summary(fit2)
## 
## Call:
## lm(formula = dw^(0.5) ~ max_length, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19936 -0.04459 -0.00439  0.03367  0.41670 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.1044604  0.0071964   14.52   <2e-16 ***
## max_length  0.0067145  0.0001414   47.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06813 on 511 degrees of freedom
## Multiple R-squared:  0.8154, Adjusted R-squared:  0.815 
## F-statistic:  2256 on 1 and 511 DF,  p-value: < 2.2e-16
# r squared is better, model is still significant
plot(dat$max_length, dat$dw^(0.5))

# Plot the square root transformed data
lm_eqn <- function(dat){
    m <- fit2;
    eq <- substitute(italic(sqrt(y)) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
         list(a = format(unname(coef(m)[1]), digits = 2),
              b = format(unname(coef(m)[2]), digits = 2),
             r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq));
}
ggplot(dat, aes(x = max_length, y = dw^(0.5))) + geom_point() + geom_smooth(method = "lm", se = TRUE, level = 0.95) + geom_text(x = 25, y = 0.8, label = lm_eqn(dat), parse = TRUE, check_overlap = TRUE) + theme_classic()

# Compare to a logtransformed model (easier to understand)

fit3 <- lm(log(dw) ~ max_length, data = dat)
plot(fit3)

summary(fit3)
## 
## Call:
## lm(formula = log(dw) ~ max_length, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.26096 -0.21548  0.03841  0.26030  1.43709 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.4469963  0.0443016  -77.81   <2e-16 ***
## max_length   0.0331391  0.0008702   38.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4194 on 511 degrees of freedom
## Multiple R-squared:  0.7395, Adjusted R-squared:  0.739 
## F-statistic:  1450 on 1 and 511 DF,  p-value: < 2.2e-16
hist(log(dat$dw))

plot(dat$max_length, log(dat$dw))

# log-log transformation
lm2 <- lm(log(shoot_dw) ~ log(max_length), data = eel_sub2)
summary(lm2)
## 
## Call:
## lm(formula = log(shoot_dw) ~ log(max_length), data = eel_sub2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0858 -0.2056  0.0128  0.1736  1.2739 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7.78423    0.11222  -69.37   <2e-16 ***
## log(max_length)  1.57549    0.02987   52.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3237 on 511 degrees of freedom
## Multiple R-squared:  0.8448, Adjusted R-squared:  0.8445 
## F-statistic:  2781 on 1 and 511 DF,  p-value: < 2.2e-16
plot(lm2)

# plot the log-log transformed data with linear model 
lm_eqn <- function(dat){
    m <- lm2;
    eq <- substitute(italic(log(y)) == a + b %.% italic(log(x))*","~~italic(r)^2~"="~r2, 
         list(a = format(unname(coef(m)[1]), digits = 2),
              b = format(unname(coef(m)[2]), digits = 2),
             r2 = format(summary(m)$r.squared, digits = 3)))
    as.character(as.expression(eq));
}
ggplot(dat, aes(x = log(max_length), y = log(dw))) + geom_point() + geom_smooth(method = "lm", se = TRUE, level = 0.95) + geom_text(x = 2.5, y = -0.5, label = lm_eqn(dat), parse = TRUE, check_overlap = TRUE) + 
  theme_classic()

visreg(lm2, gg = T)

Use selected linear model to predict 2019 biomass data

Import 2019 data

In field season 2019 we only collected in the field measurements of eelgrass shoots (not biomass). The goal of this section is to calculate the total biomass in a \(0.25m^2\) quadrat and then to extrapolate to \(1m^2\).

The data for 2019: is set up in a 3 sheet excel spreadsheets. quadrat_ID in hab_qrt is unique for every quadrat that was done in summer 2019. Use the quadrat_ID to connect to the other sheets (hab_lng and hab_wgt). Hab_lng has all the individual lengths for all measured plants (in mm) (for eelgrass its 15 blades from each quadrat/site. For kelp its up to 3 of each species collected from the site). Hab_wgt has the biomass weights for individual species biomass by bag (grams)

For this purpose we are interested in only the length data but need to use the quadrat sheet to make sure we only are looking at the eelgrass sites. Newest version of data is : RAW_10-22-19

hab_qrt <- read_excel("~/Desktop/Grad School/UAF/Data/APECS Master repository/APECS Master repo/ALL_DATA/Lia_fish/Habitat_2019_RAW_10-23-19.xlsx", sheet = 1)
hab_lng <- read_excel("~/Desktop/Grad School/UAF/Data/APECS Master repository/APECS Master repo/ALL_DATA/Lia_fish/Habitat_2019_RAW_10-23-19.xlsx", sheet = 2)
hab_wgt <- read_excel("~/Desktop/Grad School/UAF/Data/APECS Master repository/APECS Master repo/ALL_DATA/Lia_fish/Habitat_2019_RAW_10-23-19.xlsx", sheet = 3)

Data cleaning

We want only the eelgrass sites and need to convert the length measurements (in mm) to cm

glimpse(hab_qrt)
## Observations: 125
## Variables: 10
## $ quadrat_ID       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ site             <chr> "Shakan - kelp", "Shakan - kelp", "Shakan - kel…
## $ date             <dttm> 2019-06-08, 2019-06-08, 2019-06-08, 2019-06-08…
## $ YYYYMMDD         <dbl> 20190608, 20190608, 20190608, 20190608, 2019060…
## $ habitat          <chr> "kelp", "kelp", "kelp", "kelp", "kelp", "seagra…
## $ quadrat          <dbl> 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1,…
## $ total_biomass    <chr> "620.9", "230.4", "869.4", "716.0", "748.7", "N…
## $ density          <chr> "NA", "NA", "NA", "NA", "NA", "29.0", "18.0", "…
## $ flowering_shoots <chr> "NA", "NA", "NA", "NA", "NA", "0.0", "0.0", "0.…
## $ notes            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
# Want to only extract the quadrat numbers that were used to sample eelgrass sites
eel_qrts <- na.omit(ifelse(hab_qrt$habitat == "seagrass", paste(hab_qrt$quadrat_ID), NA))
# subset the length data by quadrat at an eelgrass site
lng_sub <- subset(hab_lng, quadrat_ID %in% eel_qrts)
levels(as.factor(lng_sub$species)) # only Z. marina, awesome! 
## [1] "Zostera marina"
lng_sub <- lng_sub %>%
  mutate(length_cm = length/100)

Predict based on linear models

summary(lm2)
## 
## Call:
## lm(formula = log(shoot_dw) ~ log(max_length), data = eel_sub2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0858 -0.2056  0.0128  0.1736  1.2739 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -7.78423    0.11222  -69.37   <2e-16 ***
## log(max_length)  1.57549    0.02987   52.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3237 on 511 degrees of freedom
## Multiple R-squared:  0.8448, Adjusted R-squared:  0.8445 
## F-statistic:  2781 on 1 and 511 DF,  p-value: < 2.2e-16
# formula :
#   log (y) = beta0 + beta1 * log (x) 
#   log (dw) = -7.78 + 1.57 * log (max_length) 
# create a new dataframe with the leaf lengths from 2019
newx <- data.frame(max_length = as.numeric(lng_sub$length_cm))
# use the log-log linear model (equation above) to calculate the dry mass of the shoot with 95% CI 
# make sure to exp() the values because its a log-log relationship
pr.lm <- exp(predict(lm2, newdata = newx, interval = "confidence", level = 0.95))

# graph these new predicted values
# create data.frame first
newdata <- cbind(lng_sub, pr.lm)
ggplot(newdata, aes(x = length_cm, y = fit)) + geom_point() + theme_classic() + geom_smooth(method = "lm")

Estimate total biomass

str(newdata)
## 'data.frame':    1503 obs. of  8 variables:
##  $ plant_ID  : num  27 28 29 30 31 32 33 34 35 36 ...
##  $ quadrat_ID: num  6 6 6 6 6 6 6 6 6 6 ...
##  $ species   : chr  "Zostera marina" "Zostera marina" "Zostera marina" "Zostera marina" ...
##  $ length    : num  579 835 375 334 691 323 581 418 852 544 ...
##  $ length_cm : num  5.79 8.35 3.75 3.34 6.91 3.23 5.81 4.18 8.52 5.44 ...
##  $ fit       : num  0.00662 0.01179 0.00334 0.00278 0.00875 ...
##  $ lwr       : num  0.00588 0.01069 0.00289 0.00239 0.00785 ...
##  $ upr       : num  0.00746 0.01301 0.00386 0.00324 0.00976 ...
str(hab_qrt)
## Classes 'tbl_df', 'tbl' and 'data.frame':    125 obs. of  10 variables:
##  $ quadrat_ID      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ site            : chr  "Shakan - kelp" "Shakan - kelp" "Shakan - kelp" "Shakan - kelp" ...
##  $ date            : POSIXct, format: "2019-06-08" "2019-06-08" ...
##  $ YYYYMMDD        : num  20190608 20190608 20190608 20190608 20190608 ...
##  $ habitat         : chr  "kelp" "kelp" "kelp" "kelp" ...
##  $ quadrat         : num  1 2 3 4 5 1 2 3 4 5 ...
##  $ total_biomass   : chr  "620.9" "230.4" "869.4" "716.0" ...
##  $ density         : chr  "NA" "NA" "NA" "NA" ...
##  $ flowering_shoots: chr  "NA" "NA" "NA" "NA" ...
##  $ notes           : chr  NA NA NA NA ...
hab_qrt$density <- as.numeric(hab_qrt$density)
## Warning: NAs introduced by coercion
hab_qrt$flowering_shoots <- as.numeric(hab_qrt$flowering_shoots)
## Warning: NAs introduced by coercion
# Want to only extract the quadrat numbers that were used to sample eelgrass sites
dsty <- hab_qrt %>%
  filter(habitat == 'seagrass') %>%
  mutate(density_m2 = (density)*4) %>%
  mutate(flowering_m2 = (flowering_shoots)*4)
dsty <- dsty[,c(1,8,9,11,12)]

# calculate average biomass for 15 shoots for each quadrat
require(dplyr)
df <- newdata %>%
  group_by(quadrat_ID) %>%
  mutate(avg_biomass = mean(fit))
df1 <- df %>%
  dplyr::select(quadrat_ID, avg_biomass) %>%
  distinct()

# need to add back in site information

df2 <- left_join(df1, dsty, by = "quadrat_ID")
df3 <- left_join(df2, hab_qrt[,1:2], by = "quadrat_ID")

# calculate average biomass per site based on biomass by blade and density

biom <- df3 %>%
  mutate(total_biom = avg_biomass * density_m2) %>%
  group_by(site) %>%
  mutate(total_biom_by_site = sum(total_biom)) %>%
  dplyr::select(site, total_biom_by_site) %>%
  distinct()

# look at the data table
DT::datatable(biom, caption = "Cacluated eelgrass biomass")
#write.csv(df3, "../ALL_DATA/seagrass_biomass_conversions.csv")